How did things go?
How did things go?
Our client, Kroger, is interested in a model that explains the factors influencing sales of their most popular soup products. Let’s define a couple of different models that might fit the story of how soup sales are determined. How would we determine which model is the most appropriate?
# Load packages. library(tidyverse) library(tidymodels)
Remember we can pretend that our model fits our story exactly and simulate data.
# What's this again?! set.seed(42) # Set variable and parameter values. nobs <- 300 intercept <- 30 slope <- 9 # Simulate data. sim_data_01 <- tibble( discount = round(runif(nobs, min = 0, max = 20)), sales = intercept + slope * discount + rnorm(nobs, mean = 0, sd = 5) ) sim_data_01
## # A tibble: 300 × 2 ## discount sales ## <dbl> <dbl> ## 1 18 192. ## 2 19 193. ## 3 6 89.8 ## 4 17 182. ## 5 13 145. ## 6 10 114. ## 7 15 165. ## 8 3 53.0 ## 9 13 144. ## 10 14 162. ## # ℹ 290 more rows
sim_data_01 |> ggplot(aes(x = discount, y = sales)) + geom_jitter(size = 2, alpha = 0.5) + geom_smooth(method = "lm")
We can use the binomial distribution to simulate binary data. In a binary variable there are two levels, with the level equal to zero known as the baseline level or reference level.
# Simulate some more data. sim_data_02 <- tibble( coupon = rbinom(nobs, size = 1, prob = 0.7), sales = intercept + slope * coupon + rnorm(nobs, mean = 0, sd = 5) ) sim_data_02
## # A tibble: 300 × 2 ## coupon sales ## <int> <dbl> ## 1 1 41.1 ## 2 1 43.9 ## 3 1 43.2 ## 4 0 26.7 ## 5 1 46.8 ## 6 1 30.9 ## 7 1 43.3 ## 8 1 36.4 ## 9 1 29.4 ## 10 1 29.7 ## # ℹ 290 more rows
sim_data_02 |> ggplot(aes(x = coupon, y = sales)) + geom_jitter(size = 2, alpha = 0.5) + geom_smooth(method = "lm")
Remember that when we fit a linear model we are finding the line of best fit and getting parameter estimates.
# Fit the first model.
fit_01 <- linear_reg() |>
set_engine("lm") |>
fit(sales ~ discount, data = sim_data_01)
# Fit the second model.
fit_02 <- linear_reg() |>
set_engine("lm") |>
fit(sales ~ coupon, data = sim_data_02)
Remember that our goal is to use the model to estimate the unobserved parameters from the data.
# Evaluate model fit. fit_01 |> tidy()
## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 ## 2 discount 9.01 0.0466 193. 2.46e-315
\[sales = 29.80 + 9.01 \times discount\]
If \(x\) is continuous:
What’s different about sim_data_02?
# Evaluate model fit. fit_02 |> tidy()
## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.6 0.518 57.2 1.00e-162 ## 2 coupon 9.07 0.617 14.7 3.13e- 37
\[sales = 29.60 + 9.07 \times coupon\]
If \(x\) is discrete:
See example in the starter code.
So far the parameter estimates have been point estimates, a single number that represents our best guess.
But this is a statistical model – there is always uncertainty (i.e., error). We can produce an interval estimate of the parameters, a range of numbers that represent our best guess.
# Include confidence intervals. tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 28.7 30.9 ## 2 discount 9.01 0.0466 193. 2.46e-315 8.92 9.10
These interval estimates are called confidence intervals. Given our model and the data, they are our best guess of the unobserved parameters.
If the confidence interval doesn’t include zero, we can say that the parameter estimate is statistically significant (a.k.a., significant or significantly different from zero).
We reach the same conclusion using a confidence interval as when using a p-value!
# Include confidence intervals. tidy(fit_02, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.6 0.518 57.2 1.00e-162 28.6 30.7 ## 2 coupon 9.07 0.617 14.7 3.13e- 37 7.86 10.3
Eventually we’ll want to compare how well different models fit the same data. To do that efficiently we need a single number that describes the overall model fit.
# Look at the r.squared. glance(fit_01)
## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.992 0.992 4.73 37436. 2.46e-315 1 -891. 1787. 1798. ## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The \(R^2\) is the percent of variation in \(y\) that can be explained by the model (i.e., the explanatory variable(s)). Closer to 1 is better! This is easy to misuse, so it will be our measure of overall model fit only temporarily.
Think of it this way:
While the parameter estimates are often the object of interest for an inferential model, we of course can predict the outcome using the fitted model:
\[sales = 29.60 + 9.07 \times coupon\]
To predict the outcome we need new data that represents the counterfactuals we’d like to predict to feed into the fitted model.
# Column names need to match the fitted model. scenarios <- tibble(coupon = c(0, 1))
We can predict using the fitted model and the new data.
# Predict and bind on the new data. predict(fit_02, new_data = scenarios) |> bind_cols(scenarios)
## # A tibble: 2 × 2 ## .pred coupon ## <dbl> <dbl> ## 1 29.6 0 ## 2 38.7 1
Remember that we have more than point estimates. We can compute confidence intervals for predictions as well (predictive intervals).
# Predict and bind on prediction intervals. bind_cols( predict(fit_02, new_data = scenarios), predict(fit_02, new_data = scenarios, type = "pred_int"), scenarios )
## # A tibble: 2 × 4 ## .pred .pred_lower .pred_upper coupon ## <dbl> <dbl> <dbl> <dbl> ## 1 29.6 20.0 39.3 0 ## 2 38.7 29.1 48.3 1
One of the purposes of simulating data from a model is to evaluate whether a model can recover the true parameter values. This can be used to evaluate what happens when our model/story does not accurately describe how our data came to be. You’ll see an example of this in the Exercise for this week.
For now, look at the coefficients from our first model. Did it recover the true values?
tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 28.7 30.9 ## 2 discount 9.01 0.0466 193. 2.46e-315 8.92 9.10
Imagine our client, Kroger, is interested in evaluating the effectiveness of a recent promotional campaign. Kroger had their in-house analytics team create a linear regression model for sales as a function of discount and coupon. They ultimately concluded that coupons were ineffective at driving sales. However, partway through analyzing the data, Kroger realized that their coupon system had malfunctioned, and customers they thought had received digital coupons did not. Only mailers were sent out to their older demographic customers.
Kroger is seeking your help to determine what the effects of this system glitch could have been on the results of their model.
slope_discount <- 3 slope_received_coupon <- 2 # Simulate data. sim_data_campaign <- tibble( discount = round(runif(nobs, min = 0, max = 20)), coupon = rbinom(nobs, size = 1, prob = 0.5), mail = rbinom(nobs, size = 1, prob = 0.25), received_coupon = coupon * mail, sales = intercept + slope_discount * discount + slope_received_coupon * received_coupon + rnorm(nobs, mean = 0, sd = 5) ) sim_data_campaign
## # A tibble: 300 × 5 ## discount coupon mail received_coupon sales ## <dbl> <int> <int> <int> <dbl> ## 1 11 1 0 0 66.6 ## 2 13 0 1 0 66.5 ## 3 7 1 0 0 53.4 ## 4 2 1 0 0 38.9 ## 5 20 1 1 1 86.9 ## 6 13 1 1 1 70.7 ## 7 0 1 1 1 29.3 ## 8 7 1 0 0 53.1 ## 9 11 1 0 0 55.5 ## 10 11 1 0 0 70.7 ## # ℹ 290 more rows
campaign_model <- linear_reg() |>
set_engine("lm") |>
fit(sales ~ discount + coupon, data = sim_data_campaign)
campaign_model |> tidy()
## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 30.0 0.647 46.4 4.78e-138 ## 2 discount 2.99 0.0475 62.9 1.08e-173 ## 3 coupon 0.896 0.570 1.57 1.17e- 1
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
Suppose, hypothetically, that soup sales are determined entirely by discount, the amount (in dollars and cents) that the price of a product is discounted, and coupon, whether a consumer was sent a digital coupon for a buy-one, get-one free deal. Kroger is interested in evaluating whether the discount or the coupon had a larger effect on sales.
Kroger’s couponing system had a glitch that caused the coupons not the be sent to consumers. However, this glitch was discovered after Kroger’s analysts examined the data using a linear regression model for sales as a function of discount and coupon. Your task is to simulate data that matches this scenario and determine what the impact of the couponing error is on the model that Kroger trained.
discount using a uniform distribution, coupon as a binary variable using the binomial distribution, and sales as a function of only discount, using a linear model. Remember - we expect coupon to have an effect on sales, but due to the system glitch, we are simulating data where coupon does not affect sales since consumers never received their coupons!sales ~ discount and sales ~ discount + coupon.